Aim: To build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes

Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high.

Context: This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content: The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Acknowledgements: Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

Dataset taken from https://www.kaggle.com/uciml/pima-indians-diabetes-database

In [2]:
df = df_n.rename({'6': 'pregnancies', '148': 'glucose','72': 'bloodPressure', '35': 'skinThickness','0': 'insulin', '33.6': 'bmi','0.627': 'diabetesPedigreeFunction', '50': 'age','1': 'outcome'}, axis=1)
df.head(5)
Out[2]:
pregnancies glucose bloodPressure skinThickness insulin bmi diabetesPedigreeFunction age outcome
0 1 85 66 29 0 26.6 0.351 31 0
1 8 183 64 0 0 23.3 0.672 32 1
2 1 89 66 23 94 28.1 0.167 21 0
3 0 137 40 35 168 43.1 2.288 33 1
4 5 116 74 0 0 25.6 0.201 30 0
In [3]:
# get info about data: column names and number of rows
display(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 767 entries, 0 to 766
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   pregnancies               767 non-null    int64  
 1   glucose                   767 non-null    int64  
 2   bloodPressure             767 non-null    int64  
 3   skinThickness             767 non-null    int64  
 4   insulin                   767 non-null    int64  
 5   bmi                       767 non-null    float64
 6   diabetesPedigreeFunction  767 non-null    float64
 7   age                       767 non-null    int64  
 8   outcome                   767 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None
In [4]:
# data in figures
df.hist(figsize=(12,8),bins=50)
plt.tight_layout()
In [6]:
fig=go.Figure()
fig.add_trace(go.Box(y=pos["age"],name="pos",marker_color="blue",boxpoints="all",whiskerwidth=0.3))
fig.add_trace(go.Box(y=neg["age"],name="neg",marker_color="#e75480",boxpoints="all",whiskerwidth=0.3))
fig.update_layout(template="seaborn",title="Outcome Distribution by age",height=60)
fig.show()
In [7]:
plt.figure(figsize=(15, 6))
features = ['age','pregnancies','insulin','bmi','glucose','skinThickness','bloodPressure','diabetesPedigreeFunction',
       'outcome']
corr = df[features].corr()
sns.heatmap(corr, square = True, annot=True, linewidths = 0.5, vmax = 0.2)
Out[7]:
<AxesSubplot:>
In [8]:
C=abs(corr["outcome"]).sort_values(ascending=False)[1:]
print(C)
glucose                     0.465856
bmi                         0.292695
age                         0.236417
pregnancies                 0.221087
diabetesPedigreeFunction    0.173245
insulin                     0.131984
skinThickness               0.073265
bloodPressure               0.064882
Name: outcome, dtype: float64

Most correlation for positive outcome: glucose, bmi and pregnancies. We can now look at the correlation between variables in more detail and also correlation between outcome and attributes

In [9]:
c = sns.jointplot(df['age'], df['glucose'], kind='reg')
c.fig.suptitle("Correlation between age and glucose")
Out[9]:
Text(0.5, 0.98, 'Correlation between age and glucose')
In [10]:
pg = sns.jointplot(df['pregnancies'], df['glucose'], kind='reg')
pg.fig.suptitle("Correlation pregnancies and glucose")
Out[10]:
Text(0.5, 0.98, 'Correlation pregnancies and glucose')
In [11]:
tr = sns.jointplot(df['bmi'], df['glucose'], kind='reg')
tr.fig.suptitle("Correlation between bmi and insulin")
Out[11]:
Text(0.5, 0.98, 'Correlation between bmi and insulin')
In [12]:
th = sns.jointplot(df['diabetesPedigreeFunction'], df['glucose'], kind='reg')
th.fig.suptitle("Correlation between bmi and blood pressure")
Out[12]:
Text(0.5, 0.98, 'Correlation between bmi and blood pressure')
In [13]:
ax = sns.violinplot(x="outcome", y="glucose", data=df, inner=None)
ax = sns.swarmplot(x="outcome", y="glucose", data=df,
                   color="white", edgecolor="gray")
C:\Users\SoumyaKrishnamurthy\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning:

28.6% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

In [14]:
ax = sns.violinplot(x="outcome", y="bmi", data=df, inner=None)
ax = sns.stripplot(x="outcome", y="bmi", data=df,
                   color="white", edgecolor="gray")
In [15]:
ax = sns.violinplot(x="outcome", y="diabetesPedigreeFunction", data=df, inner=None)
ax = sns.stripplot(x="outcome", y="diabetesPedigreeFunction", data=df,
                   color="white", edgecolor="gray")
In [16]:
sns.barplot(x="outcome", y="pregnancies", data=df)
Out[16]:
<AxesSubplot:xlabel='outcome', ylabel='pregnancies'>
In [17]:
ax = sns.violinplot(x="outcome", y="diabetesPedigreeFunction", data=df, inner=None)
ax = sns.swarmplot(x="outcome", y="diabetesPedigreeFunction", data=df,
                   color="white", edgecolor="gray")
C:\Users\SoumyaKrishnamurthy\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning:

38.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\SoumyaKrishnamurthy\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning:

9.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

Since this is a 0/1 prediction problem, we will be using Logistic Regression.

Now we split the data (account for oversampling) to training and testing datasets. The data will be normalized and we can apply machine learning algorithms ( I have tried 9 algorithms to compare)

The aim is to find the most accurate model.

In [20]:
start = time.time()
model_Log= LogisticRegression(random_state=10)
model_Log.fit(X_train,Y_train)
Y_pred= model_Log.predict(X_test)
end=time.time()
model_Log_time=end-start
model_Log_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of Logistic Regression model: {round((model_Log_time),5)} seconds\n")
#Plot and compute metrics
compute(Y_pred,Y_test)
Execution time of Logistic Regression model: 0.0286 seconds

Precision: 0.653 
Recall: 0.932 
F1-Score: 0.768 
Accuracy: 71.0 %
Mean Square Error: 0.29
In [21]:
start=time.time()
model_KNN = KNeighborsClassifier(n_neighbors=15)
model_KNN.fit(X_train,Y_train)
Y_pred = model_KNN.predict(X_test)
end=time.time()
model_KNN_time = end-start
model_KNN_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

print(f"Execution time of KNN model: {round((model_KNN_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of KNN model: 0.01687 seconds
Precision: 0.669 
Recall: 0.883 
F1-Score: 0.762 
Accuracy: 71.5 %
Mean Square Error: 0.285
In [22]:
start=time.time()
model_RF = RandomForestClassifier(n_estimators=300,criterion="gini",random_state=5,max_depth=100)
model_RF.fit(X_train,Y_train)
Y_pred=model_RF.predict(X_test)
end=time.time()
model_RF_time=end-start
model_RF_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

print(f"Execution time of RandomForestClassifier: {round((model_RF_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of RandomForestClassifier: 1.06807 seconds
Precision: 0.704 
Recall: 0.854 
F1-Score: 0.772 
Accuracy: 74.0 %
Mean Square Error: 0.26
In [23]:
start=time.time()
model_tree=DecisionTreeClassifier(random_state=10,criterion="gini",max_depth=100)
model_tree.fit(X_train,Y_train)
Y_pred=model_tree.predict(X_test)
end=time.time()
model_tree_time=end-start
model_tree_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

print(f"Execution time of DecisionTreeClassifier: {round((model_tree_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of DecisionTreeClassifier: 0.01006 seconds
Precision: 0.615 
Recall: 0.728 
F1-Score: 0.667 
Accuracy: 62.5 %
Mean Square Error: 0.375
In [24]:
start=time.time()
model_svm=SVC(kernel="rbf")
model_svm.fit(X_train,Y_train)
Y_pred=model_svm.predict(X_test)
end=time.time()
model_svm_time=end-start
model_svm_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

print(f"Execution time of SVC model: {round((model_svm_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of SVC model: 0.09016 seconds
Precision: 0.651 
Recall: 0.922 
F1-Score: 0.763 
Accuracy: 70.5 %
Mean Square Error: 0.295
In [25]:
start=time.time()
model_ADA=AdaBoostClassifier(learning_rate= 0.15,n_estimators= 25)
model_ADA.fit(X_train,Y_train)
Y_pred= model_ADA.predict(X_test)
end=time.time()
model_ADA_time=end-start
model_ADA_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

print(f"Execution time of AdaBoostClassifier: {round((model_ADA_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of AdaBoostClassifier: 0.1033 seconds
Precision: 0.677 
Recall: 0.835 
F1-Score: 0.748 
Accuracy: 71.0 %
Mean Square Error: 0.29
In [26]:
start=time.time()
model_GB= GradientBoostingClassifier(random_state=10,n_estimators=20,learning_rate=0.29,loss="deviance")
model_GB.fit(X_train,Y_train)
Y_pred= model_GB.predict(X_test)
end=time.time()
model_GB_time=end-start
model_GB_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

print(f"Execution time of GradientBoostingClassifier: {round((model_GB_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of GradientBoostingClassifier: 0.06016 seconds
Precision: 0.662 
Recall: 0.854 
F1-Score: 0.746 
Accuracy: 70.0 %
Mean Square Error: 0.3
In [27]:
from xgboost import XGBClassifier
start=time.time()
model_xgb = XGBClassifier(objective='binary:logistic',learning_rate=0.1, max_depth=1, n_estimators = 50,colsample_bytree = 0.5,use_label_encoder=False, eval_metric='mlogloss')
model_xgb.fit(X_train,Y_train)
Y_pred = model_xgb.predict(X_test)
end=time.time()
model_xgb_time=end-start
model_xgb_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

print(f"Execution time of model: {round((model_xgb_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of model: 2.02463 seconds
Precision: 0.669 
Recall: 0.845 
F1-Score: 0.747 
Accuracy: 70.5 %
Mean Square Error: 0.295
In [28]:
start=time.time()
model_gnb = GaussianNB()
model_gnb.fit(X_train,Y_train)
Y_pred = model_xgb.predict(X_test)
end=time.time()
model_gnb_time=end-start
model_gnb_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy

print(f"Execution time of model: {round((model_gnb_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of model: 0.04227 seconds
Precision: 0.669 
Recall: 0.845 
F1-Score: 0.747 
Accuracy: 70.5 %
Mean Square Error: 0.295
In [29]:
#Plot accuracies
accuracies={"Logistic regression": model_Log_accuracy,
            "KNN": model_KNN_accuracy,
            "SVM": model_svm_accuracy,
            "Decision Tree": model_tree_accuracy,
            "Random Forest": model_RF_accuracy,
            "Ada Boost": model_ADA_accuracy,
            "Gradient Boosting": model_GB_accuracy,
             "XG Boost": model_xgb_accuracy,
            "Naive Bayes": model_gnb_accuracy}

acc_list=accuracies.items()
k,v = zip(*acc_list) 
temp=pd.DataFrame(index=k,data=v,columns=["Accuracy"])
temp.sort_values(by=["Accuracy"],ascending=False,inplace=True)
print(temp)
                     Accuracy
Random Forest            74.0
KNN                      71.5
Logistic regression      71.0
Ada Boost                71.0
SVM                      70.5
XG Boost                 70.5
Naive Bayes              70.5
Gradient Boosting        70.0
Decision Tree            62.5
In [30]:
exe_time={"Logistic regression": model_Log_time,
            "KNN": model_KNN_time,
            "SVM": model_svm_time,
            "Decision Tree": model_tree_time,
            "Random Forest": model_RF_time,
            "Ada Boost": model_ADA_time,
            "Gradient Boosting": model_GB_time,
            "XG Boost": model_xgb_time,
            "Naive Bayes": model_gnb_time}

time_list=exe_time.items()
k,v = zip(*time_list) 
temp1=pd.DataFrame(index=k,data=v,columns=["Time"])
temp1.sort_values(by=["Time"],ascending=True,inplace=True)
print(temp1)
                         Time
Decision Tree        0.010059
KNN                  0.016870
Logistic regression  0.028605
Naive Bayes          0.042267
Gradient Boosting    0.060158
SVM                  0.090162
Ada Boost            0.103304
Random Forest        1.068065
XG Boost             2.024626

KNN is the best algorithm for both Accuracy and Time.

Execution time of KNN model: 0.00996 seconds Precision: 0.669 Recall: 0.903 F1-Score: 0.769 Accuracy: 72.0 % Mean Square Error: 0.28

F1-score should be close to 1 and Mean Squared error close to 0 Obviously there is plenty of room for improvement and academics have tuned this model by using better data and more attributes. It illutrates that not all machine learning algorithms can give very accurate results